Frontiers in Bioinformatics — Latest Matching Preprints

1

ProtAug: An Empirical Investigation of pLM-Guided Data Augmentation for Protein Sequence Prediction Tasks

Chen, Z.; Wang, R.; Luo, Q.

2026-07-11 bioinformatics 10.64898/2026.07.10.737545 medRxiv

Top 0.1%

4.3%

Show abstract

Protein language models (pLMs) offer great potential for protein sequence analysis, yet the scarcity of labeled data often limits their effectiveness in fine-tuning. Data augmentation is a promising remedy, but systematic evaluation of augmentation strategies for protein sequences remains limited, and the conditions under which augmentation confers downstream benefits are not well understood. In this paper, we systematically investigate pLM-guided substitution-based augmentation across seven protein prediction tasks. We propose ProtAug, a framework that leverages encoder-based (ESM-2) and autoregressive (ProtGPT2) pLMs to generate augmented sequences with user-controlled variation levels. Our investigation focuses on four questions: (Q1) whether pLM-synthesized sequences preserve more original signals than simpler methods, (Q2) to what extent augmentation improves prediction performance, (Q3) how variation levels affect downstream accuracy across tasks and models, and (Q4) whether biological plausibility is a necessary condition for achieving improvement. Our experimental results show that: (1) ProtAug Esm generally preserves motifs and structural similarity better than simple substitution, often comparable to homology retrieval; (2) augmentation yields consistent but task-dependent improvements, with ProtAug Esm achieving the best or second-best performance in 5 out of 7 tasks at 10% variation; (3) low-to-moderate variation levels (2-30%) perform best overall, although high-variation augmentation can benefit certain structure-related tasks; (4) the necessity of biological plausibility is task- and variation-dependent--while semantic preservation correlates with performance at low-to-moderate variation levels, improved generalization at high variation levels suggests that regularization effects, rather than label preservation, can also drive performance gains.

2

Development of Deep-Learning Models that Predict Quantitative Protein-Ligand Interac-tions in Glycobiology as a part of a Capstone Course

Yin, H.; Liu, W.; Zhou, W.; Chang, Z.; Carpenter, E. J.; Satyajith, A.; Haregu, S.; Greiner, R.; Derda, R.

2026-06-24 bioinformatics 10.64898/2026.06.19.733466 medRxiv

Top 0.1%

3.3%

Show abstract

Glycans coat the surface of all cells, and every glycan is recognised by specific glycan-binding pro-teins (GBPs). There are no general tools that can accurately estimate the binding strength between glycan and GBP from the amino acid sequence of the GBP and the molecular structure of the glycan, represented as SMILES string. We describe models for predicting such binding strengths developed as a part of a Capstone Course at the University of Alberta. The models are trained on a dataset that combines BindingDB, a published database of small-molecule protein interactions, and data from glycan arrays measured by Consortium of Functional Glycomics (CFG). In this hybrid dataset of protein-ligand interactions the ligands are both glycans from CFG and small molecules from BindingDB; similarly, proteins include GBP and proteins from BindingDB. Three models are presented (i) ProMax which fuses ESM-2, MolFormer, and MolCLR features; (ii) APEX which constrains learning to a predetermined form, a physical model of binding; (iii) UltraMax adds inter-atomic distances for the ligands. To address the dataset's severe long-tail distribution, the models employ tail-aware losses for rare high-binding instances. Trained and evaluated on approximately one million protein--ligand pairs using hold-out splits for unseen molecules, the three models provide a unified framework for quantitative glycan-protein binding prediction. We observed that learning glycan-protein binding is harder than the similar task of learning small-molecule-protein interactions. Simple mirror-inversion tests led us to postulate that insufficient use of chiral features is an important source of difficulty in learning these interactions.

3

Capabilities, specificity gaps and training-data dependence of AlphaFold3 across diverse application areas

Follonier, O.; Liu, Y.; Campomanes, P.; Lafrenaye, L.; Racle, J.; Alvarez, D.; van Gerwen, J.; Heinzmann, R.; Jänes, J.; Kummelstedt, E.; Durairaj, J.; Gfeller, D.; Vanni, S.; Beltrao, P.

2026-07-13 bioinformatics 10.64898/2026.07.13.738147 medRxiv

Top 0.2%

2.7%

Show abstract

Structure prediction models have moved from single proteins to assemblies that include diverse biomolecules and their modifications. AlphaFold3 (AF3) and related models extended structural modelling via an all-atom framework, opening many new potential applications in structural biology. We evaluate how well the new capabilities of AF3 translate into application tasks in diverse areas: prediction of ubiquitinated protein structures, T-cell receptor (TCR)-epitope recognition, antibody-antigen complexes, protein-RNA and protein-lipid interactions. We find that, while AF3 can perform well in favourable settings, this performance is uneven across applications. In RNA-target predictions, the model confidence fails to separate genuine from decoy interaction partners and in several tasks accuracy depends on the presence of related complexes in the training set. Taken together, our assessment is more cautious than for AF2, whose gains in modelling monomers and complexes were clear and broadly generalisable. AF3s extension to new biomolecule types shows less consistent performance and generalisation. AF3 can be a powerful tool for hypothesis generation and prioritisation, but its predictions and use of confidence metrics will depend strongly on the specific application area and must be interpreted with respect to training-set overlap. We expect that the benchmarks provided here will serve for testing of future developments in the structure prediction field.

4

Can a Tissue-derived Progression Signature Accurately Predict Colorectal Cancer Stage Transitions in Blood?

Sarkar, P.; Sarkar, P.

2026-06-29 bioinformatics 10.64898/2026.06.23.734006 medRxiv

Top 0.2%

2.5%

Show abstract

Colorectal cancer (CRC) is challenging to track because its molecular changes are very complex as the disease progresses, creating significant challenges for robust biomarker discovery. In this study, we developed a machine learning framework by integrating monotonic progression and the StepMiner approach. We conducted external validation to identify reproducible, consistent transcriptomic biomarkers associated with CRC progression. Gene expression datasets were analyzed across four disease states from publicly available GEO: normal colon, adenoma, primary colorectal cancer, and metastasis. First, we identified genes with monotonic expression, then used the StepMiner approach to identify genes that act as switches between stages. A balanced 74-gene signature was used for machine-learning classification with a Random Forest. External validation showed strong performance in tissue-based datasets. However, tissue-derived signatures and plasma and blood-based datasets showed poor performance, highlighting biological differences between transcriptomic profiles. Cross-filtering between tissue-derived genes and blood expression datasets was performed, which resulted in the selection of 62 blood-compatible gene signatures. Leakage-free retraining on GSE164191 achieved a mean AUC of 0.868 with balanced precision. Functional enrichment analysis showed that these genes are highly active in cancer growth. Specifically, genes CBX3, S100A11, PDK4, NCOR1, and SOX4 demonstrated stable and reliable performance across the validation fold. Overall, our study presents a progression-aware transcriptomic framework for CRC biomarker discovery and demonstrates the importance of external validation. Additionally, we evaluate whether tissue-derived signatures can predict blood profiles. This proposed approach may help the future development of tissue-based diagnostics and minimally liquid-biopsy strategies for CRC. To ensure reproducibility, our proposed workflow was automated as a Nextflow pipeline. The tissue-derived model was deployed as an application utilizing Angular, ASP.NET Core, and Plumber (R).

5

Performance Attribution in pLM-Based Biological Relation Prediction

Zhu, K.; Zhao, W.; Zhang, Y.; Xia, Z.

2026-07-02 bioinformatics 10.64898/2026.06.28.735130 medRxiv

Top 0.3%

2.4%

Show abstract

Protein language models (pLMs) have enabled strong benchmark performance in biological relation-prediction tasks, but aggregate metrics do not identify which sources of information support that performance. We examined three case studies-MetaESI, DeepGNHV, and SAGEPhos-using frozen pLM full-input baselines, endpoint- or site-restricted controls, clean train-derived prior controls, and a restriction-matched selector control. Frozen pLM-derived inputs coupled to generic downstream learners reached AUROC/AUPRC of 0.827/0.703 for the current MetaESI full-input rerun, 0.922/0.708 for a DeepGNHV two-endpoint baseline, and 0.896/0.893 for SAGEPhos. Restricted controls retained task-dependent signal. Most notably, a self-label-excluding MetaESI endpoint-frequency control reached 0.846/0.676 under the row split, numerically close to the frozen ESM2 full-input reference despite using no sequence embeddings. Clean full-catalog DeepGNHV endpoint priors and SAGEPhos kinase/substrate/site-window priors provided supplementary diagnostics rather than direct architecture-contribution estimates. In a separate fixed frozen-pooling LightGBM experiment, GARD-selected pooling did not show higher observed performance than count-matched random token pooling. Endpoint-cold diagnostics showed performance degradation, while train-label shuffling returned discrimination to approximately chance; hard- or matched-negative and family- or homology-aware evaluations were not available across the case studies. These findings do not diagnose leakage, imply memorization, exclude biological learning, or invalidate the evaluated models. Rather, they show that benchmark utility and performance attribution are distinct: architecture-specific, relation-specific, selector-specific, and generalization claims require controls matched to the interpretation being made.

6

The Gompertz curve for estimating growth rates of Protein Data Bank and protein folds

Sato, K.; TOMII, K.

2026-06-26 bioinformatics 10.64898/2026.06.24.732253 medRxiv

Top 0.3%

2.1%

Show abstract

The Protein Data Bank (PDB) is an ever-growing, open-access repository of structural data of biological molecules. This international database has been instrumental in the development of artificial intelligence and deep learning models for protein structure prediction and design. The PDB growth is a crucially important factor influencing further development of these models. Therefore, after analyzing the growth trend in PDB depositions since the archive's launch, we found that it is well fitted by the Gompertz function, a growth curve used across various disciplines. Furthermore, we observed that the function captures the "discovery of novel folds", i.e., the cumulative number of distinct folds among protein domains that constitute most of the PDB. Consequently, based on the fitting results, we estimated the likely numbers of PDB entries and protein folds. These findings provide insights into deceleration of growth in recent years and enable us to assess anticipated trends.

7

Real Science Is Harder Than Benchmarks: Evaluating Advanced AI Frameworks on Published Studies. I. Uncertainty Quantification, ML on Therapeutic Data Commons, and Agent-Based Modeling

Ahmed, M. O.; Amale, S. A.; Bhavsar, R. D.; Chopra, P.; Jaimes, A.; Kachhwah, A.; Kalotra, C. D.; Li, P.; Li, X.; Liao, Y.; Roy, R.; Senthilselvan, N.; Shao, Y.; Sharma, A. D.; Shrivatsan, A.; Xue, R.; You, Y.; Badkul, A.; Xie, L.; Oet, M.; Lee, K.; Sinitskiy, A.

2026-06-27 bioinformatics 10.64898/2026.06.24.734302 medRxiv

Top 0.3%

2.1%

Show abstract

Artificial Intelligence (AI) frameworks for automating scientific research have shown strong performance on benchmarks, but their capacity to routinely reproduce results from multiple real-life published studies remains largely untested. We evaluated five advanced AI research frameworks (Kosmos, K-Dense, ToolUniverse, BioAgents from bio.xyz, and the AI Scientist-v2 from Sakana AI) on three real-life tasks (including two recently published papers) spanning uncertainty quantification for molecular property predictions, machine learning on Therapeutic Data Commons benchmarks, and agent-based modeling. AI frameworks demonstrated genuine strengths: generating original hypotheses, competently executing routine data acquisition and coding tasks, providing statistical measures of confidence often absent from the original papers, and producing well-formatted final reports. At the same time, our experiments revealed that real-world scientific tasks remain considerably harder than current benchmarks suggest. No AI framework matched the scope or depth of the original studies, results varied across multiple runs of the same framework with the same prompt, and we documented cases of severe hallucinations in final reports, gaps in literature coverage, and overconfident conclusions. Verification of AI outputs required substantial domain expertise. While these three tasks are only partially representative of the broader scientific landscape, they offer a starting point for developing a more rigorous methodology for evaluation of AI performance than what is currently practiced. We conclude that AI frameworks are already valuable for prototyping research directions and stress-testing completed studies, and some of the limitations documented here appear largely tractable through infrastructure improvements and continued development.

8

The Hidden Disorder Divide: Reconciling Benchmark Inconsistencies in Intrinsically Disordered Protein Binding Site Prediction

Malhis, N.; Mehdiabadi, M.; Erdos, G.; Gsponer, J.; Kurgan, L.; Tosatto, S. C. E.; Dosztanyi, Z.; Piovesan, D.

2026-06-27 bioinformatics 10.64898/2026.06.24.733783 medRxiv

Top 0.4%

1.7%

Show abstract

Computational predictors of protein-binding sites within intrinsically disordered regions (IDRs) show highly inconsistent performance across high-quality benchmark datasets. To understand the origins of these discrepancies, we systematically compared predictors across three independent test sets: two CAID datasets updated with the latest DisProt annotations and a composite dataset (DBs) assembled from DIBS, FuzDB, IDEAL, and MFIB. Predictors trained predominantly on DisProt data achieved substantially higher AUCs on the CAID sets but performed poorly on the DBs. In contrast, predictors trained on older, low-quality PDB-based datasets showed balanced performance across all sets, with a slight preference for DBs. Predictors with mixed training exposure displayed intermediate behavior. Through controlled experiments using identical CNN architectures and feature analysis, we demonstrate that the dominant factor driving these performance differences is the intrinsic disorder propensity of the binding sites themselves. Binding residues in DisProt-based datasets exhibit markedly higher average disorder propensity scores than those in PDB-derived datasets. This previously unrecognized selection bias -- literature studies preferentially characterizing more disordered binding sites, while PDB-derived annotations capture less disordered ones -- effectively splits IDR-protein binding sites into two distinct categories. Predictors optimized on one category therefore generalize poorly to the other. Binding-site length and sequence conservation play only minor or negligible roles in explaining the observed inconsistencies. These findings highlight a critical limitation in current benchmarking practices and training strategies for IDR-binding site prediction, underscoring the need for more balanced and disorder-aware reference datasets. Finally, the diagnostic techniques introduced here could prove valuable beyond the specific application examined in this study.

9

Citrulline and Faecal Elastase 1 as a Combined Diagnostic Biomarker for Pancreatic Ductal Adenocarcinoma

Niazi, U.; Roberts, C. A.; McDonnell, D.; Goss, V. M.; Afolabi, P. R.; Swann, J. R.; Byrne, C. D.; Griffiths, G. O.; Hamady, Z. Z.

2026-07-19 oncology 10.64898/2026.07.16.26358209 medRxiv

Top 0.4%

1.7%

Show abstract

Background: Early detection of pancreatic ductal adenocarcinoma (PDAC) is critical. While faecal elastase-1 (FE-1) is a standard clinical marker for pancreatic function, its diagnostic accuracy for malignancy is limited. We sought to identify plasma metabolites that enhance FE-1 performance in symptomatic "at-risk" patients. Methods: Using the DEPEND cohort (CRUK C45617/A29908), plasma metabolomics was performed on patients with resectable PDAC (n=23) and healthy volunteers (n=24). Predictive modelling included feature selection and cross-validation, with further validation in an independent external cohort. Results: Citrulline was identified as significantly depleted in PDAC patients across discovery and validation cohorts. In isolation, Citrulline achieved an AUC of 0.86 (internal) and 0.88 (external validation). Standalone FE-1 demonstrated an AUC of 0.67. However, combining Citrulline and FE-1 significantly improved diagnostic performance, achieving a combined AUC of 0.96. Stratification revealed distinct metabolomic signatures associated with poorly differentiated tumours, suggesting a link to histological grade. Conclusions: Integrating Citrulline with FE-1 testing substantially improves PDAC detection in symptomatic patients. This non-invasive panel offers high diagnostic potential, though prospective validation is required to establish clinical cut-offs for routine practice.

10

Mapping Topic Change in Influential Hepatocellular Carcinoma Research: A Two-Cohort Bibliometric Analysis

Su, Z.; Li, T.

2026-07-16 oncology 10.64898/2026.07.07.26357427 medRxiv

Top 0.5%

1.7%

Show abstract

The therapeutic landscape for hepatocellular carcinoma (HCC) is evolving rapidly, necessitating scalable approaches to synthesize the expanding scientific literature. We characterized thematic shifts in HCC treatment and prognosis research by conducting a retrospective bibliometric analysis of influential publications from 2023 and 2024. Using the OpenAlex database, we identified the 50 most highly cited papers from each year based on eighteen-month post-publication citation counts. Large language models were deployed to extract, normalize, and classify concepts from unstructured text into canonical topics and parent themes, enabling quantitative year-over-year frequency comparisons. Analysis of these 100 papers revealed a distinct maturation in research focus. Although broad categories like general immunotherapy remained prevalent, their relative frequency declined in favor of specific dual immune checkpoint regimens, notably CTLA-4 inhibition and the durvalumab plus tremelimumab combination. Concurrently, parent themes related to radiomics, imaging, and health systems exhibited significant growth in the 2024 cohort. These findings demonstrate a thematic transition in high-impact HCC research from foundational immuno-oncology toward optimized combination therapies and precision diagnostics. Furthermore, this study highlights the utility of artificial intelligence-driven bibliometrics for objectively tracking dynamic conceptual shifts in oncology. A web interface for exploring the data is available at https://pri.pepkio.com/.

11

Integrated Plasma and Urinary Cell-free DNA Profiling Enables Noninvasive Molecular Detection from Ta to T4 Bladder Cancer

Riediger, A. L.; Schindler, I.; Heller, M.; Huber, J.; Sueltmann, H.; Goertz, M.

2026-07-15 oncology 10.64898/2026.07.13.26357430 medRxiv

Top 0.6%

1.5%

Show abstract

Background and Objective: Due to the heterogeneity of bladder cancer, minimally invasive molecular profiling may improve tumor characterization at the time of diagnosis. We evaluated whether integrated genomic and fragmentomic profiling of plasma and urinary circulating tumor DNA (ctDNA) detects BC-derived signals for diagnosis and disease stratification across all tumor stages. Methods: In this real-world cohort, 202 plasma and urine samples were obtained from 33 patients with non-muscle-invasive BC (NMIBC), mostly Ta tumors, and 15 patients with muscle-invasive BC (MIBC), as well as from 58 cancer-free controls. Low-coverage whole-genome sequencing was performed to assess ctDNA fragmentation, chromosomal instability and copy number variations. Matched tumor tissue was analyzed to evaluate concordance between liquid biopsy and tissue-derived molecular alterations. Key Findings and Limitations: Complementary genomic and fragmentomic profiling of cfDNA achieved detection rates of 75.8% in NMIBC patients and 91.7% in MIBC patients with paired plasma and urine. Distinct differences were observed between MIBC, NMIBC and cancer-free controls, consistent with increasing ctDNA signals during disease progression. Tumor tissue analysis confirmed BC-associated molecular alterations. Limitations include the single-center design and limited sample size. Conclusions and Clinical Implications: Multimodal profiling of plasma and urinary cfDNA enabled the detection of tumor-derived molecular signals for all bladder cancer stages, including early-stage disease. By integrating genomic and fragmentomic features, this minimally invasive approach provides molecular tumor characterization at the time of diagnosis and may support future risk-adapted diagnostic, therapeutic and surveillance strategies.

12

Glomerular Hyperfiltration, Charge Selectivity, and the Low-Dimensional Structure of Glomerular Transport

Öberg, C. M.

2026-06-28 physiology 10.64898/2026.06.23.733946 medRxiv

Top 0.6%

1.4%

Show abstract

Background The relative contributions of molecular size, electrostatic charge, and filtration rate to glomerular transport remain controversial. We hypothesized that glomerular sieving data contain a limited number of underlying transport modes that can be identified directly from experimental measurements. Methods Glomerular sieving coefficients were measured in anesthetized rats using neutral and anionic polysucrose during baseline conditions and glucagon-induced hyperfiltration. Data were analyzed using aligned-rank two-factor ANOVA, nonlinear mixed-effects regression of an electrostatic distributed two-pore model, pairwise correlation analysis, and principal component analysis. Results Hyperfiltration reduced the sieving of small and intermediate polysucrose molecules, whereas anionic polysucrose exhibited lower sieving coefficients than neutral polysucrose over a broad range of molecular sizes. An electrostatic distributed two-pore model accurately reproduced the observed effects of filtration rate and molecular charge and yielded an effective pore-wall charge density of 5.4 mC/m2 (95% confidence interval, 4.5 to 6.6). Pairwise correlation analysis revealed strong coupling between neighboring molecular sizes throughout the entire measured size range. Principal component analysis of the 2.5-8.0 nm size-selective region showed that the first principal component explained 96.3% of the variance and the first two principal components explained 99.9% of the variance. Separate analyses of the 2.5-5.0 nm and 5.0-8.0 nm transport regions showed that the first principal component explained 99.4% and 89.5% of the variance, respectively. Conclusions Glomerular sieving curves exhibited a highly constrained low-dimensional structure despite differences in molecular charge, filtration rate, and individual animals. The observed transport structure was consistent with distinct small-pore and large-pore transport domains and enabled highly effective principal component-based denoising of experimental sieving data.

13

A foundation model enables prediction of natural product molecular properties, bioactivity, and structural similarity from biosynthetic gene cluster sequence

Walker, A.

2026-07-07 bioinformatics 10.64898/2026.07.05.736569 medRxiv

Top 0.7%

1.4%

Show abstract

Genome mining is a powerful technique in natural product discovery, where biosynthetic gene clusters that are likely to produce novel or desirable natural products are identified through bioinformatic analysis. There are many more predicted biosynthetic gene clusters than can easily be experimentally characterized. Additional computational methods to prioritize biosynthetic gene clusters by the bioactivity, structural properties, or novelty of the product would make genome mining more efficient. Multiple machine learning/artificial intelligence models have been developed to predict product properties from biosynthetic gene cluster sequence, but they are limited by small quantities of training data. Model pretraining with unlabeled data is a powerful technique to develop models that can learn on a limited amount of labeled training data. Biosynthetic gene clusters are well suited to this strategy because there are many predicted clusters with only a small percentage being characterized. This paper reports BGC-MLM, a foundation model that is pretrained with a masked language task on predicted biosynthetic gene clusters and then fine-tuned for downstream applications including prediction of product structural class, bioactivity, chemical properties, counts of functional groups, and chemical fingerprint. Comparison to a model trained without pretraining shows that pretraining generally improves performance. BGC-MLM shows better or similar performance to existing specialized methods for these tasks, demonstrating its utility as a foundation model for natural product genome mining.

14

Benchmarking AlphaFold and related deep learning approaches for modeling antibody and TCR antigen recognition

Yin, R.; Saravanakumar, S.; Shi, S. Y.; Park, M.; Lin, V.; Lee, J.; Cheung, M.; Felbinger, N.; Kaufman, S.; Eisenberg, M.; Pierce, B.

2026-07-06 bioinformatics 10.64898/2026.07.04.736425 medRxiv

Top 0.7%

1.3%

Show abstract

Determining the structural basis of antigen recognition by antibodies and T cell receptors (TCRs) provides critical insights into effective immune targeting and can inform design of biotherapeutics and vaccines. Accurate computational modeling of antibodies and TCRs in complex with their targets poses a major challenge for predictive methods, including AlphaFold, which is generally accurate for modeling protein complexes but has shown limited success for immune recognition. In this study we assessed the performance of AlphaFold2, AlphaFold3, increased sampling protocols, and related deep learning methods for modeling antibody-protein, antibody-peptide, and TCR-peptide-major histocompatibility complex (pMHC) recognition. We show that increased sampling and AlphaFold3 generally improve performance relative to default sampling and AlphaFold2, however predictive accuracy and improvement levels varied considerably among interface classes, with antibody-peptide complexes representing a challenge despite their small antigen size. Comparing per-case success across methods showed some complementarity, indicating opportunities for increased success through model pooling approaches, for instance increasing antibody-peptide near-native success from 41% to 59%. Analysis of AlphaFold confidence scores and modeling of a noncanonical complex provided further insights into predictive performance. These results highlight considerations for predictive antibody and TCR complex modeling efforts, while revealing key distinctions among protocols, scoring, and immune complex classes.

15

Application of class-balancing algorithms to diverse plasma metabolomics datasets using brain tumor as an example

Godlewski, A.; Solowiej, K.; Mojsak, P.; Godzien, J.; Zelkowska, J.; Kretowski, A.; Lyson, T.; Burdukiewicz, M.; Kaminski, K.; Ciborowski, M.

2026-07-07 bioinformatics 10.64898/2026.07.02.735756 medRxiv

Top 0.7%

1.2%

Show abstract

Class imbalance remains a challenge in metabolomics research, where biological and technical variability can affect statistical inference and machine learning (ML) performance. Class-balancing algorithms address this issue by either increasing minority-class observations or reducing the number of majority-class samples. This study evaluated the impact of oversampling and undersampling algorithms on targeted and untargeted metabolomics datasets derived from LC-MS and GC-MS analyses of plasma samples from patients with glioblastoma, meningioma, and controls. Synthetic Minority Oversampling Technique (SMOTE) and Random Undersampling (RUS) were applied to balance the datasets, and their effects on data distribution, inter-feature correlations, and machine learning model performance were compared. RUS preserved the original feature distributions but reduced representativeness by removing the majority-class samples. In contrast, SMOTE introduced synthetic samples that altered covariance structures, increasing the risk of overfitting, particularly in small datasets (n=10). These effects diminished with larger groups (n=30), partially restoring correlations between metabolites. Model performance varied across the class-balancing algorithms. Random Forest classifiers benefited from both balancing methods, with undersampling often yielding higher F1 scores, whereas Support Vector Machine models showed reduced classification performance. These findings highlight the importance of selecting class-balancing strategies based on dataset size, analytical platform, and ML algorithm in metabolomics studies.

16

netPCF: Geometry-Aware Pair Correlation Functions for Spatial Biology

Moore, J. W.; Bull, J. A.; Byrne, H. M.

2026-07-07 bioinformatics 10.64898/2026.07.02.736020 medRxiv

Top 0.7%

1.2%

Show abstract

Spatial organisation is a defining feature of biological systems, underpinning cellular interactions, tissue function, disease progression and therapeutic response. Identifying and quantifying spatial organisation may require methods that resolve relationships across spatial scales. The pair correlation function (PCF) quantifies spatial dependence between points across multiple length scales, but its standard Euclidean formulation is poorly suited to data defined on irregular, curved or otherwise structured domains, where tissue geometry may constrain biological organisation and distort Euclidean distances. Here, we introduce netPCF, a geometry-aware extension of the PCF for quantifying spatial organisation on complex biological domains. By representing tissue structures, anatomical surfaces and other constrained geometries as spatial networks, netPCF generalises the PCF beyond extrinsic Euclidean settings. The framework derives the expected behaviour of the statistic under complete spatial randomness using interpretable finite-support kernels, provides bootstrap-based uncertainty quantification, and includes practical criteria for assessing domain discretisation adequacy. We further extend netPCF to marked (labelled) biological data using feature kernels for categorical and continuous attributes, enabling unified analysis of cell identities, marker intensities, phenotypic states, gene expression and other quantitative features on structured domains in any spatial dimension. All methods are implemented in the open-source Python package spacenet. Synthetic studies show that netPCF recovers classical Euclidean behaviour on sufficiently resolved networks and is robust to common imaging noise. We demonstrate its utility in two biological applications. In three-dimensional imaging mass cytometry data from HER2+ breast carcinoma, netPCF separates tissue architecture-driven proximity from biologically meaningful endothelial and immune cell organisation. In reconstructed surfaces of developing murine embryos, netPCF identifies a transition in the Wnt1-Wnt6 relationship from short-range co-localisation at E9.5 to spatial exclusion at E11.5, a pattern of ectodermal boundary refinement not captured by prior voxel-wise co-expression analysis. Overall, netPCF provides a statistically grounded and practical framework for quantifying spatial organisation on complex biological domains.

17

AptCancerDB: A Curated Knowledgebase and Translational Discovery Platform for Anticancer Aptamers

Bajiya, N.; Singh, S.; Raghava, G. P. S.

2026-07-09 cancer biology 10.64898/2026.07.02.735999 medRxiv

Top 0.8%

1.1%

Show abstract

Aptamers are emerging as important molecular recognition ligands in oncology, playing significant roles in cancer diagnostics, targeted therapies, drug delivery systems, and molecular imaging. Numerous aptamers have advanced to clinical trials, indicating their potential for real-world applications; however, existing databases fail to capture that. To bridge this critical gap, we developed AptCancerDB (https://webs.iiitd.edu.in/raghava/aptcancerdb/), a comprehensive, manually curated database of experimentally verified anticancer aptamers. The current release contains 1,941 entries collected from studies published between 2000 and 2025, covering 29 cancer types, approximately 200 cancer cell lines, and direct links to 22 clinical trials. Each entry is annotated with sequence information, target details, cancer type, cell line, SELEX methodology, affinity determination data, chemical modifications, and biological activities. The dataset is dominated by 82.7% ssDNA, reflecting its superior stability and ease of synthesis, while only 16.6% is ssRNA and appears primarily in studies targeting complex intracellular or protein-protein interactions. To facilitate structural analysis, predicted secondary structures, dot-bracket notations, specific structural elements, and minimum free energy values were also included. AptCancerDB integrates a MySQL backend with an ArcadeDB/OpenCypher-based Knowledge Graph, enabling exploration of relationships among aptamers, targets, cancer types, cell lines, and functional applications. The platform provides advanced search and browsing facilities, BLASTn-based similarity searching, and GC Calculator. Built on a modern, responsive frontend (React/TypeScript/Tailwind CSS), the platform includes a REST API for data retrieval. By integrating fragmented experimental data into a unified cancer-focused resource, AptCancerDB serves as a valuable resource for comparative analysis, aptamer discovery, and the development of next-generation aptamer-based diagnostics and therapeutics. HighlightsO_LICurated knowledge base of experimentally validated anticancer aptamers. C_LIO_LIAptCancerDB contain therapeutic, tumor-homing and cell-penetrating aptamers. C_LIO_LISummarizes clinical progress and translational trends in anticancer aptamer research. C_LIO_LISupports rational aptamer design using molecular, functional, and clinical annotations C_LIO_LIDisease-focused resource for cancer diagnosis, therapy, and drug delivery C_LI TeaserAptCancerDB maintains experimentally validated anticancer aptamers relevant to diagnosis, drug delivery, and therapy.

18

ReCo: a self-configuring and self-extending agentic framework for biomedical research

Tzanis, E.; Klontzas, M. E.

2026-07-16 health informatics 10.64898/2026.07.14.26358025 medRxiv

Top 0.8%

1.1%

Show abstract

This study presents ReCo (Research Cosmos), a self-configuring and self-extending agentic research framework for the biomedical domain. ReCo is orchestrated by a large language model that interacts with native computing tools, bundled Model Context Protocol (MCP) servers, structured skills, persistent project memory, and a desktop interface. Its bundled MCP servers provide biomedical analysis capabilities while serving as implementation paradigms for integrating new computational and AI frameworks. Structured skills encode procedures for environment configuration and framework ingestion, enabling ReCo to inspect repositories, manuscripts, or local codebases; identify dependencies and execution patterns; create isolated runtime environments; design and implement MCP interfaces. Self-extension was evaluated using five heterogeneous systems: the Merlin computed tomography foundation model, MAISI-v2 medical image synthesis framework, asari liquid chromatography-mass spectrometry workflow, DosimeTron agentic radiation-dosimetry platform, and Orthanc DICOM server. ReCo successfully operationalized all five systems and completed predefined functional evaluations. Re-hosted DosimeTron outputs demonstrated near-perfect agreement with the reference pipeline across 651 organ observations (Pearson correlation and Lin concordance correlation coefficient, 0.99999; mean absolute percentage difference, 0.37%). Notably, ReCo configured Orthanc as a PACS-like coordination layer, integrated it with DosimeTron, Merlin, and TotalSegmentator, and orchestrated data retrieval, analysis, and return of valid DICOM RTSTRUCT, RTDOSE, and Structured Report. ReCo provides a unified environment for configuring, documenting, and operationalizing heterogeneous biomedical frameworks, reducing technical barriers to the adoption and integration of emerging computational and AI methods. The official open-source ReCo GitHub repository is available at: https://github.com/eltzanis/ReCo

19

Homology-aware cross-validation strategies for generalization assessment in RNA structure prediction

Bugnon, L.; Kulemeyer, G.; Gerard, M.; Di Persia, L.; Stegmayer, G.; Milone, D. H.

2026-06-29 bioinformatics 10.64898/2026.06.28.735057 medRxiv

Top 0.9%

1.1%

Show abstract

RNA secondary structure prediction is a fundamental challenge in bioinformatics, essential for understanding the functional roles of non-coding RNAs. Recently, deep learning models have transformed the field with impressive results, leading to critical discussions regarding the validity of current cross-validation strategies. On the one hand, traditional random partitioning yields overop-timistic results due to data leakage from uncontrolled homology. On the other hand, removing from the training set all sequences that exhibit even the slightest resemblance to the testing sequences penalizes learning-based methods by requiring generalization to completely out-of-distribution sequences. While it is very simple to remove sequences and retrain a machine learned model, it is very difficult to remove the experimental data used for parameter tuning and the sequences used for the development of classical thermodynamic methods. Thus, these methods often benefit from an implicit knowledge leakage. In this work we critically review existing cross-validation strategies for RNA secondary structure prediction: random splitting, clustering-based splitting, and leaving one RNA family out for testing. We analyze the advantages and limitations of each strategy, also expanding them towards the future directions to ensure fair comparisons across the full range of sequence similarities, with the same rigor for both classical and learning-based methods.

20

Functional Data Analysis of Spatial Clustering Identifies Prognostic T Cell Patterns in Ovarian Cancer

Sakitis, C. J.; Liao, D.; Reid, B. M.; Townsend, M. K.; Schildkraut, J. M.; Lawson, A. B.; Tworoger, S. S.; Terry, K. L.; Peres, L. C.; Wrobel, J.; Soupir, A. C.; Fridley, B. L.

2026-07-09 cancer biology 10.64898/2026.07.02.735980 medRxiv

Top 0.9%

1.1%

Show abstract

Spatial proteomic imaging technologies enable the simultaneous assessment of immune cell abundance and spatial organization within the tumor microenvironment. Spatial clustering is commonly summarized using measures such as Ripleys K or nearest-neighbor G-functions at a fixed radius. However, these approaches depend on scale selection and may obscure biologically relevant patterns occurring across spatial ranges. We propose a functional data analysis (FDA) framework to model spatial clustering trajectories derived across a continuum of radii. Functional principal component analysis (FPCA) was used to summarize dominant modes of spatial variation, and resulting scores were incorporated into Cox proportional hazards models as both main effects and interaction with immune cell abundance. The approach was applied to multiplex immunofluorescence data from five ovarian cancer studies, comprising 773 high-grade ovarian serous tumors. Analyses focused on CD3+ and CD8+ T cell populations within the tumor compartment of the tissue, adjusting for age at diagnosis and cancer stage, with study-specific estimates combined using random-effects meta-analysis. Higher abundance of both T cells and CD8+ T cells was consistently associated with improved overall survival. Beyond abundance, spatial features captured by the leading functional principal component were independently associated with survival, particularly for CD8+ T cells. Interaction models further showed that the prognostic effect of immune infiltration depended on spatial clustering, with tumors characterized by high abundance and low spatial clustering exhibiting the most favorable outcomes. These findings indicate that spatial organization provides complementary prognostic information beyond abundance alone and suggests that more diffuse immune infiltration may reflect more effective anti-tumor activity in ovarian cancer. Overall, FDA offers a flexible and interpretable framework for modeling spatial clustering across scales and identifying prognostic spatial features not captured by fixed-radius or distance analyses.